A hearing-inspired approach for distant-microphone speech recognition in the presence of multiple sources
نویسندگان
چکیده
This paper addresses the problem of speech recognition in reverberant multisource noise conditions using distant binaural microphones. Our scheme employs a two-stage fragment decoding approach inspired by Bregman’s account of auditory scene analysis, in which innate primitive grouping ‘rules’ are balanced by the role of learnt schema-driven processes. First, the acoustic mixture is split into local time-frequency fragments of individual sound sources using signal-level primitive grouping cues. Second, statistical models are employed to select fragments belonging to the sound source of interest, and the hypothesis-driven stage simultaneously searches for the most probable speech/background segmentation and the corresponding acoustic model state sequence. The paper reports recent advances in combining adaptive noise floor modelling and binaural localisation cues within this framework. By integrating signal-level grouping cues with acoustic models of the target sound source in a probabilistic framework, the system is able to simultaneously separate and recognise the sound of interest from the mixture, and derive significant recognition performance benefits from different grouping cue estimates despite their inherent unreliability in noisy conditions. Finally, the paper will show that missing data imputation can be applied via fragment decoding to allow reconstruction of a clean spectrogram that can be further processed and used as input to conventional ASR systems. The best performing system achieves an average keyword recognition accuracy of 85.83% on the PASCAL CHiME Challenge task. © 2012 Elsevier Ltd. All rights reserved.
منابع مشابه
Simultaneous recognition of multiple sound sources based on 3-d n-best search using microphone array
The recognition of distant talking speech in a noisy and reverberant environments is key issue in any speech recognition system. A so-called hands-free speech recognition system plays an important role in the natural and friendly human-machine interface. Considering the practical use of a speech recognition system, we realize that such a system has to deal, also, with the case of the presence o...
متن کاملModel-based Sparse Component Analysis for Multiparty Distant Speech Recognition
This thesis takes place in the context of multi-microphone distant speech recognition in multiparty meetings. It addresses the fundamental problem of overlapping speech recognition in reverberant rooms. Motivated from the excellent human hearing performance on such problem, possibly resulting of sparsity of the auditory representation, our work aims at exploiting sparse component analysis in sp...
متن کاملStatistical sound source identification in a real acoustic environment for robust speech recognition using a microphone array
It is very important for a hands-free speech interface to capture distant talking speech with high quality. A microphone array is an ideal candidate for this purpose. However, this approach requires localizing the target talker. Conventional talker localization methods in multiple sound source environments not only have difficulty localizing the multiple sound sources accurately, but also have ...
متن کاملTowards Robust Speech Acquisition using Sensor Arrays
An integrated system approach was developed to address the problem of distant speech acquisition in multi-party meetings, using multiple microphones and cameras. Microphone array processing techniques have presented a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering and directional discrimination. These techniques relied on accurate sp...
متن کاملNeural network based regression for robust overlapping speech recognition using microphone arrays
This paper investigates a neural network based acoustic feature mapping to extract robust features for automatic speech recognition (ASR) of overlapping speech. In our preliminary studies, we trained neural networks to learn the mapping from log mel filter bank energies (MFBEs) extracted from the distant microphone recordings, including multiple overlapping speakers, to log MFBEs extracted from...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computer Speech & Language
دوره 27 شماره
صفحات -
تاریخ انتشار 2013